
.. _data-curator-gpu-deduplication:

#######################################################
GPU Accelerated Exact and Fuzzy Deduplication
#######################################################

-----------------------------------------
Background
-----------------------------------------

Training on randomly selected documents for many epochs can be sub-optimal to downstream performance for language models.
For more information on when this is harmful, please see `Muennighoff et al., 2023 <https://arxiv.org/abs/2305.16264>`_ and `Tirumala et al., 2023 <https://arxiv.org/abs/2308.12284>`_.
The exact and fuzzy document-level deduplication module in the NeMo Curator aims at reducing the occurence of duplicate and
near-duplicate documents in the dataset. Exact deduplication refers to removing identical (i.e., document strings are equal)
documents from the dataset, while fuzzy deduplication refers to removing near-identical (e.g., an excerpt of a document is used in another document)
documents from the dataset.

Both functionalities are supported in NeMo Curator and accelerated using `RAPIDS <https://rapids.ai>`_.
Exact dedpulication works by hashing each document and only keeping one document per hash.
Fuzzy deduplication is more involved and follows the method outlined in `Microsoft Turing NLG 530B <https://arxiv.org/abs/2201.11990>`_.

-----------------------------------------
Usage
-----------------------------------------
As exact deduplication is a much less involved procedure and requires significantly less compute,
we typically will first run exact deduplication before fuzzy deduplication. Also, from our experience in
deduplicating Common Crawl snapshots, a significant portion of the duplicates are in fact exact duplicates.

When removing near-duplicates within the corpus we perform fuzzy deduplication at the document level in order to remove documents that
have high Jaccard similarity. Our approach closely resembles the approach described in `Smith et al., 2020 <https://arxiv.org/abs/2201.11990>`_. This
approach can essentially be split into two conceptual changes. The first stage involves computing MinHashes Signatures on
documents and then performing Locality Sensitive Hashing (LSH) to find candidate duplucates. Due to the approximate nature of the bucketing via MinHash + LSH
(`Leskovec et al., 2020 <http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf>`_) we process each of the buckets to remove any potential false positives that may have been hashed into the buckets.



Before running either of these modules, users should assign a unique document ID to each document in the corpus.
This can be accomplished using the :code:`add_id` module within the NeMo Curator:

.. code-block:: bash

         add_id \
           --input-data-dir=<Path to directory containing jsonl files> \
           --log-dir=./log/add_id

By default, this will create a new field named :code:`adlr_id` within each json document which will have the form "doc_prefix-000001".
If the dataset already has a unique ID this step can be skipped.

**Note**: Fuzzy deduplication only works with numeric ID's or the specific ID format generated by the :code:`add_id` script. If the
dataset does not contain ID's in this format it's recommended to convert to an integer based ID or ID created by the :code:`add_id` script.

Once a unique ID has been added to each document, users can proceed with exact and fuzzy deduplication which roughly require the following
steps (all scripts are included in the :code:`nemo_curator/scripts/` subdirectory):

* Exact dedup
    1. Input: Data directories
    2. Output: _exact_duplicates.parquet. List of exact duplicates and the document hash.

* Fuzzy Dedup

  1. Compute Minhashes
    - Input: Data Directories
    - Output: minhashes.parquet for each data dir.
    - Example call:

         .. code-block:: bash

                 # same as `python compute_minhashes.py`
                 gpu_compute_minhashes \
                   --input-data-dirs /path/to/jsonl/dir1 /path/to/jsonl/dir2 \
                   --output-minhash-dir /path/to/output_minhashes \
                   --input-json-text-field text_column_name \
                   --input-json-id-field id_column_name \
                   --minhash-length number_of_hashes \
                   --char-ngram char_ngram_size \
                   --hash-bytes 4(or 8 byte hashes) \
                   --seed 42 \
                   --log-dir ./
                   # --scheduler-file /path/to/file.json


  2. Buckets (Minhash Buckets)
    - Input: Minhash directories
    - Output: Buckets.parquet
    - Example call:

         .. code-block:: bash

                 # same as `python minhash_lsh.py`
                 minhash_buckets \
                   --input-data-dirs /path/to/output_minhashes/dir1 /path/to/output_minhashes/dir2 \
                   --output-bucket-dir /path/to/dedup_output \
                   --input-minhash-field _minhash_signature \
                   --input-json-id-field id_column_name \
                   --minhash-length number_of_hashes \
                   --num-bands num_bands \
                   --buckets-per-shuffle 1 `#Value b/w [1-num_bands]. Higher is better but might lead to oom` \
                   --log-dir ./
                   # --scheduler-file /path/to/file.json

  3. Jaccard Map Buckets
    - Input: Buckets.parquet + Data Dir
    - Output: anchor_docs_with_bk.parquet
    - Example call:

         .. code-block:: bash

                 # same as `python map_buckets.py`
                 jaccard_map_buckets \
                   --input-data-dirs /path/to/jsonl/dir1 /path/to/jsonl/dir2 \
                   --input-bucket-dir /path/to/dedup_output/_buckets.parquet \
                   --output-dir /path/to/dedup_output \
                   --input-json-text-field text_column_name \
                   --input-json-id-field id_column_name \
                   # --scheduler-file /path/to/file.json

  4. Jaccard Shuffle
    - Input: anchor_docs_with_bk.parquet + Data Dir
    - Output: shuffled_docs.parquet
    - Example call:

         .. code-block:: bash

                 # same as `python jaccard_shuffle.py`
                 jaccard_shuffle \
                   --input-data-dirs /path/to/jsonl/dir1 /path/to/jsonl/dir2 \
                   --input-bucket-mapping-dir /path/to/dedup_output/anchor_docs_with_bk.parquet \
                   --output-dir /path/to/dedup_output \
                   --input-json-text-field text_column_name \
                   --input-json-id-field id_column_name \
                   # --scheduler-file /path/to/file.json

  5. Jaccard compute
    - Input: Shuffled docs.parquet
    - Output: jaccard_similarity_results.parquet
    - Example call:

         .. code-block:: bash

                 # same as `python jaccard_compute.py`
                 jaccard_compute \
                   --shuffled-docs-path /path/to/dedup_output/shuffled_docs.parquet \
                   --output-dir /path/to/dedup_output \
                   --ngram-size char_ngram_size_for_similarity \
                   # --scheduler-file /path/to/file.json

  6. Connected Components
    - Input: jaccard_similarity_results.parquet
    - Output: connected_components.parquet
    - Example call:

         .. code-block:: bash

                 # same as `python connected_components.py`
                 gpu_connected_component \
                   --jaccard-pairs_path /path/to/dedup_output/jaccard_similarity_results.parquet \
                   --output-dir /path/to/dedup_output \
                   --cache-dir /path/to/cc_cache \
                   --jaccard-threshold 0.8
                   # --scheduler-file /path/to/file.json


In addition to the scripts, there are examples in the `examples` directory that showcase using the python module
directly in your own code. It also has examples on how to remove documents from the corpus using the list of duplicate IDs generated from exact or fuzzy
deduplication.
